Spark: Coalesce

Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or improved version of repartition() where the movement of the data across the partitions is lower using coalesce.

Coalesce uses existing partitions to minimize the amount of data that’s shuffled.
Coalesce is not doing full scan and can reduce the partition size from original partition size.
Coalesce can only decrease the number of partitions and can create uneven partitions.
If you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead, each of the 100 new partitions will claim 10 of the current partitions and this does not require a shuffle.

case:- 1

Let’s take a situation like this, you have initially created an RDD and it has N partitions and on that RDD you have applied filter transformation, spark applies transformation on the partitions of RDD so if in case the data inside a partition is completely filtered out then also spark will maintain the number of partitions as the same as it has while creating the RDD initially, this scenario is same for all the narrow transformations(Transformations where shuffling is not required).

val rdd= sc.parallelize(1 to 4)
rdd.getNumPartitions

val rdd1= rdd.filter(x => x%2 == 0)
rdd1.collect

rdd1.getNumPartitions

In the above case, you can see that we have created an RDD that contains 1 to 4 and it has 8 partitions and after applying filter transformation also the number of partitions are the same. Which means there are few partitions with empty contents. So in these situations, you can go for coalesce() to reduce the number of partitions as shown below.

val rdd2 = rdd1.coalesce(2)
rdd2.getNumPartitions

rdd2.collect

Case:-2

As we know spark has two kind of operation (wide - eg join, all by key operation and narrow - eg map, flatmap, filter).suppose have 2 TB data and divided the data into 164 MB of block and doing narrow operation again and again on the data,eventually getting small block files if we do any shuffle operation that time performance will be degrading since data need to travel between the nodes. so that is reason need to do the coalesce before doing the shuffle operation.

Case:-3

While operating on DataFrames and saving it as a CSV file, since the DataFrame consists of many partitions, the output will be of multiple files, but coalescing it into 1 partition would result in a single CSV file.

Spark

Coalesce

No comments:

Post a Comment